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ABSTRACT 


Currently, the vector space model algorithm has been widely implemented 
for the document search feature because of its reliability in retrieving 
information. One of them in the search for verses of the Qur'an based on the 
translation. However, if the phrase or word used is different (even though it 
has one meaning) with the word in the document in the database, the system 


will not display the verse. As we know that the Qur'an has a very deep 
meaning, so an interpretation of the verse is needed. Therefore, this research 
Keywords: focuses on implementing the vector space model (VSM) algorithm for 
searching verses and hadiths in science and technology by using the 


eae th discussion parameters of these verses or hadiths. The test results obtained 

Qur an with 20 keyword samples using metric recall were 81% with an average time 

Science and technology of 2.24 seconds. 

Searching 

Vector space model This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

The Qur'an is the essence of all science. However, the knowledge contained in the Qur'an is still in 
the form of seeds and principles. The Qur'an contains the principles of all science, including technology and 
knowledge of the universe. The Qur'an contains various levels of definition and layered meaning for all 
readers [1]. Sunnah or hadith according to El-Naggar [2] is everything that comes from the Prophet 
Muhammad, in the form of words, behavior, determination of the Prophet, nature, sirah (biography of the 
Prophet SAW), both before and after he was sent as a prophet and apostle. The sunnah of the Prophet is the 
second life guide after the Qur'an. 

Information technology facilitates all human needs in almost all aspects. Smartphones, Search 
engines and television are real examples of the application of information technology [3]. Search engines like 
Google, Bing, Yahoo and Ask in their search process adopted a system called the information retrieval 
system [4]. This system accepts keywords or input from the user to the document based on its suitability to 
the keywords [5]. 

The vector space model (VSM) algorithm is an information retrieval model that presents documents 
and keywords into vectors in multidimensional spaces. The similarity of both can be measured by calculating 
the angles formed by document vectors and queries [4]. With this algorithm, we can use words, phrases, or 
sentences for the keyword [6]. The structure of phrases and sentences does not have to be exactly the same as 
the document in the database. VSM adopts similarity measure for matching between documents and user 
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query, and assign scores from the biggest to smallest. The documents and query are assigned with weights 
using term frequency and inverse document frequency method [7]. There are many variants TF-IDF model to 
get the weights including term frequency, classical TF-IDF, normalized TF-IDF and sub-linear normalized 
TF-IDF [8]. 

Currently, it has many applications hadith or the Qur'an which include search features. However, the 
technique is still limited to the search for a match search word (string matching) so that the results often do 
not appear when a query is given in the form of a sentence or phrase [8]. In addition, until now there has been 
no application or system that specifically provides information on the verses of the Qur'an and hadith relating 
to science and technology [9]. Whereas in the interpretation of Shaykh Thantawi Al-Jawahir, it is said that in 
the Holy Qur'an there are more than 750 verses of kauniyah (verses about the universe [10]. 

Many search techniques are currently used by researchers. One of the most popular search 
techniques is word matching or string matching. The concept of searching for this technique is to match a 
certain word pattern to a long sentence or text [11]. However, if the wording is not the same as in the 
database, the search will not be found. Another alternative technique for finding information that is also 
currently popular is the vector space model algorithm that adopts word weighting. With this technique, the 
wording does not have to be exact. Even if the user enters the words upside down or in the wrong order in the 
database, the system will still return the information. In previous studies related to the implementation of the 
VSM algorithm for searching Al-Qur'an verses [12], the parameters used were through the translation of Al- 
Qur'an verses and expansion queries to improve keywords to be based on data in the database. However, if 
the user enters a different query but has the same meaning and is not in the database, the system will not be 
able to return the information it is looking for. Therefore this research in its implementation tries to use the 
discussion that comes from the books of commentaries of the Qur'an and the Hadith whose language 
structure is closer to daily life as its parameters. 


2. RESEARCH METHOD 

In this paper, research is conducted with the steps that can be seen in Figure 1. The first step taken is 
to enter keywords (can be in the form of words, phrases or sentences), then do the preprocessing to get a list 
of word tokens which then do the weight calculation tf-idf after it obtained a list of queries and documents 
weights used in the computation algorithm VSM [13-16] if the cosine value>0 then shown a list of relevant 
data otherwise found no data is displayed. 
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Figure 1. Vector space model flowchart 
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This research uses a software development method called the prototype model [17-19]. The 
application to be built is a website based application that is built using the Django framework and Python. To 
be able to access it, users need a browser [20]. The application architecture developed is illustrated in Figure 2. 
First, the client will send a URL request through the browser. This URL contains the application pathname. If 
it is true, then the HTML page of the templates and data from the Model in the database will be sent to the 
webserver and displayed. 
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— 


Figure 2. Application architecture 


3. RESULTS AND DISCUSSION 
3.1. Collecting text data of Qur’an and hadith 

The verses data used in this research is in the form of Indonesian translation Qur’an and hadith text 
data that obtained from the Mushaf Al-'Alim edition of science [1], Book of scientific miracles on earth and 
space [5] and three series of books proving science in the sunnah El-Naggar [2]. In addition, until now there 
has been no application or system that specifically provides information on the verses of the Qur'an and 
hadiths relating to science and technology [9]. 


3.2. Preprocessing 

As has been described in the flowchart that this algorithm uses the weight of the query and 
document to find the resemblance [21]. Before entering the algorithm calculation, the document and query 
must go through the preprocessing (text processing) to get the weight value. This stage is divided into 4 parts, 
namely case folding, filtering, stemming, and tokenizing [22]. 
There are 4 known documents and Q (input query): 
D1 =Sesungguhnya manusia berada dalam kerugian 
D2 =Demi langit yang mempunyai gugusan bintang 
D3 =Aku tidak akan menyembah apa yang kamu sembah 
D4 =Sesungguhnya mereka adalah orang-orang yang merugi 
Q =Manusia yang rugi 
- Case folding 

The first stage in preprocessing is leveling letters (case folding) into a lowercase [23]. 
D1=Sesungguhnya manusia berada dalam kerugian 
— Filtering 

This stage removes the stopword (meaningless word) and punctuation in each document such as the 
word ‘adalah, yang, apa, dan sebagainya’ [24]. 
D1=Sesungguhnya manusia kerugian 
—- Stemming 

The stemming stage is the process of converting words into their basic form, by removing affixes to 
the word [25]. 
D1=Sungguh manusia rugi 
- Tokenizing 

The last stage is tokenizing which functions to separate each word into tokens [26] as: sungguh, 
manusia, rugi, langit, gugusan, bintang, sembah, orang. 
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3.3. Word weighting (TF-IDF) 

In Table 1 the term column is words that had previously gone through the preprocessing stage. Next, 
we calculate the frequency of terms that appear in each document. If the document contains the term sought, 
then the value is one, if not, then the value is zero. Column Q is the keyword sample that you want to find, 
that is, human loss. After that, we calculate the frequency of occurrence of each term in the whole document 
(DF) [27]. Based on calculations on the theoretical basis, to determine how important a term is to an entire 


document, we must calculate the value of the IDF (inverse document frequency) of each term with (1) [28]. 
After that, we will get the term weight by multiplying the term frequency value and the IDF [29]. 


idf; = log = (1) 


Table 1. TF-IDF weightings 


aoe TF IDF W (TF*IDF) 
Q Di D2 D3 D4 (log(N/DF) Q DI D2 D3 D4 

Sungguh 0 1 0 0 1 2 0.30103 0 0.30103 0 0 0.30103 
Manusia 1 1 0 0 0 1 0.60206 0.60206 0.60206 0 0 0 

Rugi 1 1 0 0 1 2 0.30103 0.30103 0.30103 0 0 0.30103 
Langit 0 0 1 0 0 1 0.60206 0 0 0.60206 0 0 
Gugusan 0 0 1 0 0 1 0.60206 0 0 0.60206 0 0 
Bintang 0 0 1 0 0 1 0.60206 0 0 0.60206 0 0 
Sembah 0 0 0 2 0 2 0.30103 0 0 0 0.60206 0 

Orang 0 0 0 O 1 1 0.60206 0 0 0 0 0.60206 


3.4. VSM algorithm process 

The first step in calculating this algorithm, we calculate the size of the query vector and the document 
vector using (2) and (3) on the theoretical basis of the algorithm by first squaring the weight value (Table 2 
columns W4 through Wp.’). After that, the square value of each document and query is added up (Table 2 
rows in blue) and then the square root values (Table 2 rows in orange) are searched. The second step 
multiplies each term weight in each document by the query weight (Table 2 columns Wgo*Wnp: to Wo*Wpa) 
then add up the values for each document (Table 2 red rows). To get the document similarity values 
according to (4) for the values that we have added in the red line with the results of the square root (in Table 
2 orange lines). The final ranking of the document is indicated by a gray bar. We can see that the documents 
that are relevant to the query are documents one and four (the value is not equal to zero) and the document 
that has the highest level of similarity to the query is document one or D1. 











= 2| t 
|q| — j=1 Wig)” (3) 
Wig (Wig)? . [Dina wig)? 
Table 2. VSM algorithm process 

Term W 5 Wor Wp Wp? Woe? W *Wpı W *Wp2 W *W3 W *Wp4 

0 0.09062 0 0 0.090619 0 0 0 0 

Sungguh 0.36248 0.36248 0 0 0 0.362476 0 0 0 
Manusia 0.09062 0.09062 0 0 0.090619 0.090619 0 0 0.090619 

Rugi 0 0 0.362476 0 0 0 0 0 0 

Langit 0 0 0.362476 0 0 0 0 0 0 

Gugusan 0 0 0.362476 0 0 0 0 0 0 

Bintang 0 0 0 0.362476 0 0 0 0 0 

Sembah 0 0 0 0 0.362476 0 0 0 0 

Sum 0.4531 0.54371 1.087429 0.362476 0.543714 Sum Wo*Wp; 

SQRT 0.67312 0.73737 1.042798 0.60206 0.73737 0.453095 0 0 0.090619 
Rank 0.912871 0 0 0.204124 
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3.5. Testing 
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This test is intended to test the performance of the vector space model algorithm using 20 sample 
keywords through the recall method [4]. The test results can be seen in Table 3. 


Table 3. Vector space model testing 


No. Keyword Time — J Explanation 

1. Bintang 00.02.76 V - .100 = 100% 
2. Perbintangan 00.03.32 V - .100 = 100% 
X Gunung sebagai pasak 00.02.43 y > . 100 = 83% 
4. Ilmu genetika 00.03.10 V 4 .100 = 100% 
5. Komputer 00.01.80 V : .100 = 100% 
6. Manfaat siwak 00.02.15 V : .100 = 100% 
I Matahari yang menggerhana 00.02.86 V 2 . 100 = 81% 
8. Gerhana matahari 00.01.78 y = .100 = 90% 
9. Besi berasal dari langit 00.02.33 y = .100 = 81% 
10 Pengaruh gen keturunan 00.01.42 y z .100 = 71% 
11. Tanaman 00.02.22 y : .100 = 100% 
12. Tujuh lapis bumi 00.01.67 V 2 .100 = 81% 
13. Angin yang berhembus 00.02.48 y : .100 = 100% 
14. Laut yang mendidih 00.02.19 y : .100 = 66% 
15. | Meteor 00.02.36 V : .100 = 100% 
16. Orbit bumi 00.02.15 V = .100 = 83% 
17. Tulang-belulang 00.02.54 V - .100 = 100% 
18. Gravitasi bumi 00.02.08 y 2 .100 = 83% 
19. Nebula 00.01.89 V - .100 = 100% 
20. Asam nukleat 00.01.46 y - .100 = 100% 


Testing is done by calculating the percentage of recall [30] to get the algorithm's ability to reinvent 
information [4]. Based on the information on each test data, the average recall value of the system is: 


the number of relevant documents that are called 


Recall = 


The number of relevant documents in the database 


The system recall results showed an algorithm testing value of 81% for the level of success in 
finding back information with an average time of 2.24 seconds. The failure of the system to retrieve some 
information can be caused by the ability of the local server and the amount of data that is processed [31]. 


4. CONCLUSION 

In this paper, the vector space model algorithm is applied to the system through four stages, namely, 
preprocessing, calculation of document weights and queries, calculation of cosine angles of document and 
query vectors, and ranking of documents. Based on the test results using 20 sample data, it can be concluded 
that the Vector space model algorithm provides good performance in rediscovering verses in the Qur'an or 
Hadiths relating to Science and Technology by 81% with an average time of 2.24 seconds. Vector space 
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model algorithm cannot solve writing errors, so it is expected in future studies to implement expansion 
queries to get better search results. 
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